Smoothing for Bracketing Induction
نویسندگان
چکیده
Bracketing induction is the unsupervised learning of hierarchical constituents without labeling their syntactic categories such as verb phrase (VP) from natural raw sentences. Constituent Context Model (CCM) is an effective generative model for the bracketing induction, but the CCM computes probability of a constituent in a very straightforward way no matter how long this constituent is. Such method causes severe data sparse problem because long constituents are more unlikely to appear in test set. To overcome the data sparse problem, this paper proposes to define a non-parametric Bayesian prior distribution, namely the Pitman-Yor Process (PYP) prior, over constituents for constituent smoothing. The PYP prior functions as a back-off smoothing method through using a hierarchical smoothing scheme (HSS). Various kinds of HSS are proposed in this paper. We find that two kinds of HSS are effective, attaining or significantly improving the state-ofthe-art performance of the bracketing induction evaluated on standard treebanks of various languages, while another kind of HSS, which is commonly used for smoothing sequences by ngram Markovization, is not effective for improving the performance of the CCM.
منابع مشابه
Unsupervised Induction of Labeled Parse Trees by Clustering with Syntactic Features
We present an algorithm for unsupervised induction of labeled parse trees. The algorithm has three stages: bracketing, initial labeling, and label clustering. Bracketing is done from raw text using an unsupervised incremental parser. Initial labeling is done using a merging model that aims at minimizing the grammar description length. Finally, labels are clustered to a desired number of labels ...
متن کاملTransformation Based Error Driven Parsing
In this paper we describe a new technique for parsing free text a transformational grammar is automatically learned that is ca pable of accurately parsing text into binary branching syntactic trees The algorithm works by beginning in a very naive state of knowledge about phrase structure By repeat edly comparing the results of bracketing in the current state to proper bracketing provided in the...
متن کاملA Feature-Rich Constituent Context Model for Grammar Induction
We present LLCCM, a log-linear variant of the constituent context model (CCM) of grammar induction. LLCCM retains the simplicity of the original CCM but extends robustly to long sentences. On sentences of up to length 40, LLCCM outperforms CCM by 13.9% bracketing F1 and outperforms a right-branching baseline in regimes where CCM does not.
متن کاملCapitalization Cues Improve Dependency Grammar Induction
We show that orthographic cues can be helpful for unsupervised parsing. In the Penn Treebank, transitions between upperand lowercase tokens tend to align with the boundaries of base (English) noun phrases. Such signals can be used as partial bracketing constraints to train a grammar inducer: in our experiments, directed dependency accuracy increased by 2.2% (average over 14 languages having cas...
متن کاملUse of Two Smoothing Parameters in Penalized Spline Estimator for Bi-variate Predictor Non-parametric Regression Model
Penalized spline criteria involve the function of goodness of fit and penalty, which in the penalty function contains smoothing parameters. It serves to control the smoothness of the curve that works simultaneously with point knots and spline degree. The regression function with two predictors in the non-parametric model will have two different non-parametric regression functions. Therefore, we...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013